智能论文笔记

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Hirofumi Inaguma , Sravya Popuri , Ilia Kulikov , Peng-Jen Chen , Changhan Wang , Yu-An Chung , Yun Tang , Ann Lee , Shinji Watanabe , Juan Pino

分类：自然语言处理

2022-12-15

Direct speech-to-speech translation (S2ST), in which all components can be optimized jointly, is advantageous over cascaded approaches to achieve fast inference with a simplified pipeline. We present a novel two-pass direct S2ST architecture, {\textit UnitY}, which first generates textual representations and predicts discrete acoustic units subsequently. We enhance the model performance by subword prediction in the first-pass decoder, advanced two-pass decoder architecture design and search strategy, and better training regularization. To leverage large amounts of unlabeled text data, we pre-train the first-pass text decoder based on the self-supervised denoising auto-encoding task. Experimental evaluations on benchmark datasets at various data scales demonstrate that UnitY outperforms a single-pass speech-to-unit translation model by 2.5-4.2 ASR-BLEU with 2.83x decoding speed-up. We show that the proposed methods boost the performance even when predicting spectrogram in the second pass. However, predicting discrete units achieves 2.51x decoding speed-up compared to that case.

translated by 谷歌翻译

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Sravya Popuri , Peng-Jen Chen , Changhan Wang , Juan Pino , Yossi Adi , Jiatao Gu , Wei-Ning Hsu , Ann Lee

分类：自然语言处理

2022-04-06

直接语音到语音翻译（S2ST）模型与传统级联系统可用的数据量相比，几乎没有平行的S2ST数据遇到数据稀缺问题，该数据包括自动语音识别（ASR），机器翻译（MT）和文本到语音（TTS）合成。在这项工作中，我们使用未标记的语音数据和数据扩展来探索自我监督的预训练，以解决此问题。我们利用了最近提出的语音到单位翻译（S2UT）框架，该框架将目标语音编码为离散表示形式，并转移前训练前和有效的部分填充技术，可很好地适用于语音到文本翻译（S2T）通过研究语音编码器和离散单位解码器预训练，S2UT域。我们在西班牙语 - 英语翻译上进行的实验表明，与多任务学习相比，自我监督的预训练始终如一地提高模型性能，平均为6.6-12.1 BLEU增长，并且可以与数据增强技术相结合，以应用MT来创建弱监督监督的培训数据。音频样本可在以下网址获得：https：//facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html。

translated by 谷歌翻译

Textless Speech-to-Speech Translation on Real Data

Ann Lee , Hongyu Gong , Paul-Ambroise Duquenne , Holger Schwenk , Peng-Jen Chen , Changhan Wang , Sravya Popuri , Juan Pino , Jiatao Gu , Wei-Ning Hsu

分类：自然语言处理 | 人工智能 | 机器学习

2021-12-15

我们介绍了一种无线文字语音转换（S2ST）系统，可以将来自一种语言的语音转换为另一种语言，并且可以在不需要任何文本数据的情况下构建。与文献中的现有工作不同，我们解决了模拟多扬声器目标语音的挑战，并用现实世界的S2ST数据训练系统。我们方法的关键是一种自我监督的单位语音标准化技术，该标准化技术将预先训练的语音编码器具有来自多个扬声器的配对声音，以及单个参考扬声器，以减少由于复印件引起的变化，同时保留词汇内容。只有10分钟的语音标准化的配对数据，我们在培训\ vp〜s2st数据集上的S2ST模型时获得平均3.2 BLEU增益，而不是在未标准化的语音目标上培训的基线。我们还将自动开采的S2ST数据纳入并显示额外的2.0 BLEU增益。据我们所知，我们是第一个建立无线的S2ST技术，可以用真实世界的数据培训，并为多种语言配对工作。

translated by 谷歌翻译

Direct Simultaneous Speech-to-Speech Translation with Variational Monotonic Multihead Attention

Xutai Ma , Hongyu Gong , Danni Liu , Ann Lee , Yun Tang , Peng-Jen Chen , Wei-Ning Hsu , Phillip Koehn , Juan Pino

分类：自然语言处理

2021-10-15

我们提出了直接同时的语音转换（SIMUL-S2ST）模型，此外，翻译的产生与中间文本表示无关。我们的方法利用了最近与离散单位直接语音转换的最新进展，其中从模型中预测了一系列离散表示，而不是连续频谱图特征，而不是以无监督的方式学习，并直接传递给语音的声码器综合在一起。我们还介绍了变分单调的多口语注意力（V-MMA），以处理语音同声翻译中效率低效的政策学习的挑战。然后，同时策略在源语音特征和目标离散单元上运行。我们开展实证研究，比较级联和直接方法对Fisher西班牙语 - 英语和必需的英语西班牙语数据集。直接同步模型显示通过在翻译质量和延迟之间实现更好的权衡来优于级联模型。

translated by 谷歌翻译

Generative appearance replay for continual unsupervised domain adaptation

Boqi Chen , Kevin Thandiackal , Pushpak Pati , Orcun Goksel

分类：计算机视觉 | 人工智能

2023-01-03

Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.

translated by 谷歌翻译

MGTAB: A Multi-Relational Graph-Based Twitter Account Detection Benchmark

Shuhao Shi , Kai Qiao , Jian Chen , Shuai Yang , Jie Yang , Baojie Song , Linyuan Wang , Bin Yan

分类：计算机视觉

2023-01-03

The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.

translated by 谷歌翻译

Explaining Imitation Learning through Frames

Boyuan Zheng , Jianlong Zhou , Chunjie Liu , Yiqiao Li , Fang Chen

分类：机器学习 | 计算机视觉

2023-01-03

As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.

translated by 谷歌翻译

Saliency-Aware Spatio-Temporal Artifact Detection for Compressed Video Quality Assessment

Liqun Lin , Yang Zheng , Weiling Chen , Chengdong Lan , Tiesong Zhao

分类：计算机视觉

2023-01-03

Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.

translated by 谷歌翻译

Risk-Averse MDPs under Reward Ambiguity

Haolin Ruan , Zhi Chen , Chin Pang Ho

分类：机器学习

2023-01-03

We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.

translated by 谷歌翻译

Policy Pre-training for End-to-end Autonomous Driving via Self-supervised Geometric Modeling

Penghao Wu , Li Chen , Hongyang Li , Xiaosong Jia , Junchi Yan , Yu Qiao

分类：计算机视觉

2023-01-03

Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.

translated by 谷歌翻译